Unless you are a complete stranger to the world of deep learning, you have most likely already heard the term activation function. So what are activation functions? Put simply, an activation function is mapping of inputs of a neuron to output in artificial neural networks.
You might question which activation function you should use for your neural network because there are so many out there. By the end of this article, I’ll make sure all your doubts related to activation functions are gone. Specifically, we will learn:
Why activation functions are required
Different activation functions available in the market
Which one you should use for your deep learning model
A practical example of using activation functions discussed here, on a hypothetical regression problem.
Without any further delay, let’s get started with Why.
Artificial neural networks are roughly based on our brains’ neural nets, in the way that multiple nodes (or neurons) are interconnected and signals can pass through these nodes. It’s the hierarchical structure that gives us such amazing results. The whole idea behind activation functions is to roughly model the ways neurons communicate in the brain with each other.
Now let’s come to the original question. Couldn’t we just multiply the input by the weight of each neuron, add a bias and propagate that result forward? No, because the activation functions do something very important. They introduce non-linear properties to our network. Why is this a good thing? A linear function is a polynomial of just one degree. Like y = 2x or y = x. So if you were to map these functions on a graph they would always form a straight line. If we added more dimension they would form a plane or a hyperplane. But their shape would always be perfectly straight with no curves of any kind. That’s why we call them linear. Linear equations are easy to solve but they are limited in their complexity. We want to be able to represent any kind of function with our neural network. Neural networks must serve the purpose of universal function approximators. That means that they should compute any function at all. Hence, we need a way to compute not just linear functions but non-linear ones as well. If we didn’t use non-linear activation functions then no matter how many layers our neural network had it would still behave just like a single-layer network because summing that gives another linear function. This is not strong enough to model many kinds of data.
Now that we know why activation functions are required let’s discuss some of the most popular ones.
Sigmoid has a mathematical form of ** f(x) = 1 / (1 + e^(-x)) **. It takes some number and squashes it into the range between 0 and 1.
Image courtesy: Wikipedia
It was one of the first activation functions to be used in an artificial neural network. It is pretty easy to understand but it has two problems that have made it fall out of popularity recently:
It causes gradients to vanish.
When a neuron’s activation saturates close to either a 0 or 1, the gradients in this region are very close to 0. During backpropagation, this local gradient will be multiplied by the gradient of the states’ output to the whole objective. So if the local gradient is really small, it will make the gradients slowly vanish and close so no signal will flow through the neuron to its weights, and recursively to its data.
Its output isn’t zero centered.
As I already mentioned, the output of the sigmoid function ranges between 0 and 1. This means that the values after the function will be positive and that makes the gradient of the weights either all positive or all negative. This makes the gradient updates go too far in different directions which makes optimization harder.
So have we improved on sigmoid? Well, there is another activation function called the hyperbolic tangent function (or tanh). It squashes the input into the range between -1 and 1, instead of 0 and 1, so its output is zero centered. This makes optimizations easier. In practice, tanh is always preferred to sigmoid, but just like sigmoid it also suffers from the vanishing gradient problem.
Image courtesy: Wikipedia
Enter relu, or the rectified linear unit. This activation function has become very popular in the last few years. It’s just ** max(0, x) **. This means that the output is 0 when input is less than 0 and linear with the slope of 1 when the input is greater than 0. It was noted that it has a 6 times improvement in convergence over tanh in the landmark imagenet classification paper by Krizhevsky et. all. A lot of times in computer science we find that the simplest, most elegant solution is the best, and this applies to relu as well. It doesn’t involve expensive operations like tanh or sigmoid so it learns faster and it avoids the vanishing gradient problem.
Image courtesy: Wikipedia
Almost all deep networks use relu nowadays, but it is only used for hidden layers. The output layer should use a Softmax function for classification problems since it gives probabilities for different classes and linear functions for regression problems.
One problem that relu sometimes has is that some units can be fragile during training and explode. Meaning, a big gradient flowing through a relu neuron could cause a weight update that makes it never activate on any data point again. So, gradients flowing through relu will always be 0 from that point on.
To fix this problem, a variant called leaky relu was introduced. Instead of the function being 0 when input is less than 0, it has a small negative slope. There is also another popular variant called maxout which is the generalized form above relu and leaky relu. But, there is a tradeoff as it doubles the number of parameters for each neuron.
Image courtesy: Wikipedia
Now finally, let’s discuss which activation function to use in a particular scenario. The answer is relu in most cases - especially in hidden layers of your neural networks. But if you find a lot of your neurons explode during training than try a variant like leaky relu or maxout. And although relu should be applied to the hidden layers, the output layers should use a softmax for classification or a linear function for regression. There are other activation functions out there and there is still a lot of room for improvement in this area. However, we discussed the most popular options available out there right now.
Let us train a regression model with each of the activation functions discussed above using the same training data. The objective is to compare the performance of the model when activation functions are changed. Also note that the objective is not to explain how to create or train a regression model. The sole purpose is just to see how results vary by changing the activation function between model layers.
We will use Pytorch to define and train the model. However, similar results can be easily reproduced if you use any other framework, such as Keras, on the same training data. The problem is to predict the price of a house given that you have 79 explanatory variables describing every aspect of the house. The problem is taken from kaggle and you can download the training data from its competition page to reproduce the results we get here.
Let’s start by importing the necessary libraries.
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd
import torch
from torch
import nn, optim
from torch.utils.data
import DataLoader
import torch.nn.functional as F
from sklearn.model_selection
import train_test_split
from sklearn.preprocessing
import MinMaxScaler
Now let’s read the training data and fill null values with median. Here, we assume that train.csv is available in the current folder.
1
2
3
4
data = pd.read_csv('train.csv')
data = pd.get_dummies(data, dummy_na = True, drop_first = True)
data.drop('Id', axis = 1, inplace = True)
data.fillna(data.median(), inplace = True)
We will need to normalize all columns of dataset between 0 and 1, except the SalePrice column since it is the value to be predicted and will not be given as input to the model during training.
1
2
3
4
5
6
columns = data.columns
sale_price = data['SalePrice']
scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data), columns = columns)
data['SalePrice'] = sale_price
data.head()
This will print following output on the terminal:
The data is now ready to be given as input to the model for training. But before we define our model, let us split the dataset into training and validation sets so that we can monitor the performance while the model is being trained.
1
X_train, X_val, y_train, y_val = train_test_split(data.drop('SalePrice', axis = 1), sale_price, test_size = 0.2, random_state = 42)
Here, random state helps to reproduce the same train-test split every time we run this program. This is not appropriate for training in a production environment, but good for learning purposes. Let us define a simple Pytorch model as given below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Regressor(nn.Module):
def __init__(self):
super()
.__init__()
self.fc1 = nn.Linear(288, 144)
self.fc2 = nn.Linear(144, 72)
self.fc3 = nn.Linear(72, 18)
self.fc4 = nn.Linear(18, 1)
self.activation = nn.Sigmoid()
def forward(self, x):
x = self.activation(self.fc1(x))
x = self.activation(self.fc2(x))
x = self.activation(self.fc3(x))
x = self.activation(self.fc4(x))
return x
Note the value of self.activation, in Regressor class. I have set it to F.Sigmoid(). This corresponds to the sigmoid activation function we discussed in this article.
Now, lets quickly compile the model and define a simple training loop before we can start the training.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
train_batch = np.array_split(X_train, 50)
label_batch = np.array_split(y_train, 50)
for i in range(len(train_batch)):
train_batch[i] = torch.from_numpy(train_batch[i].values)
.float()
for i in range(len(label_batch)):
label_batch[i] = torch.from_numpy(label_batch[i].values)
.float()
.view(-1, 1)
X_val = torch.from_numpy(X_val.values)
.float()
y_val = torch.from_numpy(y_val.values)
.float()
.view(-1, 1)
model = Regressor()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr = 0.001)
epochs = 10
train_losses, test_losses = [], []
for e in range(epochs):
model.train()
train_loss = 0
for i in range(len(train_batch)):
optimizer.zero_grad()
output = model(train_batch[i])
loss = torch.sqrt(criterion(torch.log(output), torch.log(label_batch[i])))
loss.backward()
optimizer.step()
train_loss += loss.item()
test_loss = 0
accuracy = 0
with torch.no_grad():
model.eval()
predictions = model(X_val)
test_loss += torch.sqrt(criterion(torch.log(predictions), torch.log(y_val)))
train_losses.append(train_loss / len(train_batch))
test_losses.append(test_loss)
print("Epoch: {}/{}.. ".format(e + 1, epochs),
"Training Loss: {:.3f}.. ".format(train_loss / len(train_batch)),
"Test Loss: {:.3f}.. ".format(test_loss))
Note that we are training the model for 10 epochs only, just for the purpose of a demo. You can set this variable’s value at any appropriate value. Here is the train and test loss of the model after 10 epochs of training (with Sigmoid as the activation function between linear layers):
Similarly, you can set the self.activation variable in Regressor class to following values and retrain the model from scratch:
nn.Tanh()
nn.ReLU()
nn.LeakyReLU()
Here is the screenshot of training output for LeakyReLU activation function:
Compare this training loss with training loss of Sigmoid. Significant improvement, isn’t it?
So, that was it! I hope you got a new perspective towards one of the important hyperparameters of modern deep learning models.